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Abstract 

Background: Cork oak (Quercus suber) is one of the rare trees with the ability to produce cork, a material widely 
used to make wine bottle stoppers, flooring and insulation materials, among many other uses. The molecular 
mechanisms of cork formation are still poorly understood, in great part due to the difficulty in studying a species 
with a long life-cycle and for which there is scarce molecular/genomic information. Cork oak forests are of great 
ecological importance and represent a major economic and social resource in Southern Europe and Northern Africa. 
However, global warming is threatening the cork oak forests by imposing thermal, hydric and many types of novel 
biotic stresses. Despite the economic and social value of the Q. suber species, few genomic resources have been 
developed, useful for biotechnological applications and improved forest management. 

Results: We generated in excess of 7 million sequence reads, by pyrosequencing 21 normalized cDNA libraries derived 
from multiple Q. suber tissues and organs, developmental stages and physiological conditions. We deployed a stringent 
sequence processing and assembly pipeline that resulted in the identification of -159,000 unigenes. These were 
annotated according to their similarity to known plant genes, to known Interpro domains, GO classes and E.C. 
numbers. The phylogenetic extent of this ESTs set was investigated, and we found that cork oak revealed a significant 
new gene space that is not covered by other model species or EST sequencing projects. The raw data, as well as the 
full annotated assembly, are now available to the community in a dedicated web portal at www.corkoakdb.org. 

Conclusions: This genomic resource represents the first trancriptome study in a cork producing species. It can be 
explored to develop new tools and approaches to understand stress responses and developmental processes in forest 
trees, as well as the molecular cascades underlying cork differentiation and disease response. 
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Background 

Oaks (Quercus spp.) are important trees of the Northern 
hemisphere. In Europe they form highly valuable wide- 
spread forests. Together with chestnut and beech, oaks 
belong to the Fagaceae, and are probably the best-known 
genus of the family. The evergreen cork oak (Q. suber) 
grows in the Western Mediterranean Basin, having as nat- 
ural range Algeria, France, Italy, Morocco, Portugal, Spain 
and Tunisia, where it is managed under low-density an- 
thropogenic open woodland forests. Quercus spp. are im- 
portant for conservation of soil and water, biodiversity, 
natural landscape and climate, and for production of highly 
valuable materials, thus having high ecological, social and 
economic value. 

Quercus suber shares with Phellodendron amurense 
(Amur cork tree) and Q. variabilis (Chinese cork oak) 
the odd ability of producing a continuous and renewable 
out-bark of cork, although only Q. suber cork has the 
fine physical and chemical properties for a highly profit- 
able industrial use. 

Portugal owns the credits of the world leading position 
on cork oak forest area (740,000 ha out of the world 
2,200,000 ha), cork production (60% of the world exported 
cork volume), and cork processing (74% of world proc- 
essed cork). In Portugal, in the past, oaks used to domin- 
ate the native forests but their area has rapidly decreased 
as a result of human activity. Still, cork oak forests are 
accounting for about 26% of the Portuguese forest [1], 

However, cork oak (Q. suber) and holm oak (Q. ilex ssp. 
rotundifolia) decline reported in the Iberian Peninsula 
over the last 20 years has caused death of numerous trees, 
threatening the rural economy in this part of Europe 
[2-5]. It has been predicted that oak diseases in Europe 
could become more severe and expand to the North and 
East within the next few hundred years [6]. 

Nowadays, this species faces many other threats, such 
as drought, extreme temperature and pests, leading to a 
marked decline of cork oak stands, possibly related to the 
repeated successions of extremely dry and hot years with 
a significant reduction of springtime precipitation [7]. 

The relevance of Q. suber and the scarce information 
available on its genetics, biochemistry and physiology [8-14] 
fully justifies the generation of transcriptomics data that will 
allow a new insight on cork oak biology and genetics. These 
data are fundamental for designing selection programs 
and understanding the plant adaptation processes to both 
biotic and abiotic factors, plant's plasticity, ecophysiologi- 
cal interactions, interspecific hybridization and gene flow. 

For a species that has neither its genome sequenced, 
nor a physical map available, the information obtained 
from expressed sequence tags (ESTs) is a practical means 
for gene discovery and a way to start elucidating its physi- 
ology and functional genome. When this project started (in 
2010) there were less than 300 ESTs available for Q. suber. 



Recently, this number has increased to almost 7,000 (http:// 
www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html). 

Other oak species have also been subjected to transcrip- 
tomic studies, namely two European white oak species (Q. 
petraea, sessile oak, and Q. robur, English oak) [15,16], two 
American oak species (Q. alba, white oak, and Q. rubra, 
red oak) (reviewed in [17]). Ueno et al. [15] generated 
222,671 non-redundant sequences (including alternative 
transcripts) from multiple cDNA libraries prepared from 
Q. petraea and Q. robur, which is a relevant resource for 
genomic studies and identification of genes of adaptive sig- 
nificance. In 2011, the same team produced another useful 
tool, a BAC library, for genome analysis in Q. robur [18]. 
Another important tool to develop a physical map for a 
Fagaceae species was based on the work of Durand and co- 
workers [19], who produced a total of 256 oak EST-SSRs 
that were assigned to bins and their map position was fur- 
ther validated by linkage mapping (http://www.fagaceae. 
org). More recently, [16] generated the larger-to-date set of 
reads from the transcriptome of an oak species (Q. robur), 
combining 454 and Illumina sequencing. 

Within a national initiative, Portugal organized a 
consortium to study cork oak ESTs (COEC - Cork oak 
ESTs Consortium, http://coec.fc.ul.pt/), where 12 pro- 
jects were designed to obtain a deeper understanding 
of Q. suber functional genomics. Developmental aspects 
(gametophytes, fruit and embryo development, acorn ger- 
mination, bud sprouting, vascular and leaf development), 
as well as cork formation and quality, and abiotic (oxida- 
tive stress, drought, heat, cold and salinity) and biotic 
interactions (including symbiosis and pathogenesis) were 
followed by 20 teams from all over the country. Two of 
these projects were fully dedicated to the bio-informatics 
analysis of the generated data and development of bio- 
informatics platforms, one of them further focusing on 
polymorphism detection and validation. 

This paper presents the experiments conducted for large- 
scale sequencing of 21 cDNA libraries and construction of 
a cork oak transcriptome database containing 159,000 
unigenes. Presently, this database constitutes one of the 
largest genomic resources available for oaks and was struc- 
tured to accommodate future data on genomics and physi- 
ology of woody species. The tools that were generated are 
crucial to study cork oak biology and diversity, and to 
understand gene regulation and adaptation to a changing 
environment. Future developments will make possible the 
early detection of traits of interest. This initiative will con- 
tribute to genomic research in cork oak and the Fagaceae 
family, paving the way for further studies. 

Results and discussion 

Sequencing 

We have constructed 21 libraries from Q. suber as de- 
scribed in Table 1. The libraries were constructed from 
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Table 1 Tissues and conditions used to produce the 


RNA libraries 




cDNAIibrary 


Library description 


L-1 


Phloem (adult trees) 


L-2 


Xylem (adult trees) 


L-3 


Abiotic stress: control (leaves) 


L-4 


Abiotic stress: cold (leaves) 


L-5 


Abiotic stress: heat (leaves) 


L-6 


Seed germination 


L-7 


Female flowers 


L-8 


Male flowers 


L-9 


Embryos from fruits at 4 developmental stages 


L-1 0 


Whole fruits at 7 developmental stages 


1 1 


Biotic Stress: roots (germinated acorns) infected 




by Phytophthora cinnamomi. 


L-1 2 


Biotic Stress: roots (thin white roots from 1 8-month-old 




plants) infected by Phytophthora cinnamomi. 


L-1 3 


Mycorrhizal symbiosis (roots). 


L-1 4 


Annual stems from cork producing Quercus 




suber x cerris hybrid trees 


L-1 5 


Annual stems from cork non-producing Quercus 




suber x cerris hybrid trees 


L-1 6 


Bud sprouting (bud phases 1 and 2). 


L-1 7 


Bud sprouting (bud phases 3 and 4). 


L-1 8 


Abiotic Stress: drought, salt and oxidative 




stresses (roots and shoots) 


L-1 9 


Leaves (from 8 locations for polymorphism detection) 


L-20 


High quality cork 


L-21 


Low quality cork 



All libraries were normalized. 



total RNA extracted from multiple tissues, developmental 
stages and stress conditions. Libraries were normalized by 
the Duplex-Specific Nuclease-technology [20], with the 
aim of increasing gene-space coverage and sequenced in a 
454 GS-FLX with Titanium Chemistry (Roche). A total of 
7,445,712 reads were produced, ranging from 40 to 587 bp, 
with an average length ranging between 185 and 310 bp 
(Table 2). An initial pre-processing step to remove contam- 
inants, low quality sequences and short sequences resulted 
in a reduction to nearly 5 million nuclear reads (4,968,463), 
with average lengths ranging between 209 and 321 bp 
(Table 2). Our approach resulted in a higher number and 
comparable read length as compared to other multi-library 
projects [Moser:2005ju; Ueno:2010bv; ONeil:2010bk; [21]]. 

Assembly 

A stringent assembly pipeline was implemented and is 
summarized in Figure 1. The assembly methodology is 
described in the Materials and Methods section, consist- 
ing of two stages: first each library was assembled indi- 
vidually, and secondly all assembled libraries were further 



assembled (assembly of assemblies). The choice of this 
two-step protocol lied in the asynchronous nature of the 
libraries being sequenced, and the need to deal with future 
libraries that are expected to be generated for other condi- 
tions and stress types. The choice of parameters in our 
protocol maximized the number of contigs and their 
length (in MIRA --AL:egp = no:mrs = 85 reduces gap pen- 
alties and permits longer matches; --AS:mrpc = 1 allows 
for single read contings, thus increasing the number of 
contigs), was extensively validated, and is described in 
greater detail in a companion paper (in preparation). We 
opted for de novo assembly, as the lack of a closely related 
species with a completely sequenced genome resulted in 
poor assembly (not shown). The assembly statistics for 
each library are shown in Table 2. A total of 577,852 puta- 
tive unigenes was achieved, including 501,257 contigs and 
76,122 singlets. Each library produced from 8,442 up to 
50,522 putative unigenes. These were all subjected to one 
additional assembly step (see Material and Methods sec- 
tion), which reduced the number of putative unigenes to 
approximately 159,298 unigenes. The final unigene length 
distribution is shown in Figure 2A. An average unigene 
length of 148.5 bp was found, which is smaller than those 
obtained in another oak using a combination the same 
sequencing platform with Sanger sequencing [15,16] 
(see Table 3). A BlastP of all the unigenes the NR data- 
base finds Plant best hits in 97.3% of the cases, with the 
remaining being hits to other species that are likely con- 
taminations not removed by our pipeline. A plot with 
the species distribution of these non-plant species is 
found on CorkOakDB.org. 

Coverage and depth 

The large number of libraries used, together with the 
choice of a two-step assembly, resulted in a high redun- 
dancy. Most of the nearly 5 million filtered ESTs were 
assembled into a large number of unigenes (~159 K). 
We obtained an average coverage depth of 3.9 (number of 
times each nucleotide was sequenced), with a maximum 
depth of 429 (25% percentile = 1; 75% percentile = 5). This 
is higher than other recent tree EST projects using the 
same sequencing platform (e.g. [22]), likely due to the ex- 
tensive number of libraries sequenced in this project, pre- 
pared from multiple tissues, developmental stages and 
stress conditions. After the two rounds of assembly, 61,687 
high quality reads remained unassembled and were treated 
as singletons. Thus, 65% of our unigenes derive from 
contigs, higher than other recent comparable projects 
(see Table nine in [15]). 

In the absence of a complete genome sequence, it is 
impossible to know the true coverage of the cork oak gene 
space offered by this project. However, when we queried 
the proteomes of Arabidopsis thaliana and Populus tricho- 
carpa using BLASTp to determine the potential number of 
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Table 2 Sequencing statistics 



Library 


Raw reads 




Processed reads 




Individual assemblies 




# 


<l> 


# 


<l> 


# total 


Contigs 


Singlets 


L-1 


392152 


200.2 


216861 


232.3 


30220 


26693 


3527 


L-2 


315360 


203.0 


208162 


237.6 


23962 


21499 


2463 


L-3 


182571 


193.6 


1 1 8708 


209.1 


16399 


15272 


1127 


L-4 


215084 


195.7 


147735 


210.8 


19573 


18060 


1513 


L-5 


1 53898 


185.2 


97870 


203.0 


14372 


13255 


1117 


L-6 


371060 


286.7 


279793 


304.5 


32700 


27735 


4965 


L-7 


346435 


235.1 


216309 


253.7 


30694 


28179 


2515 


L-8 


393501 


248.9 


285776 


264.2 


33550 


29758 


3792 


L-9 


524852 


295.0 


433762 


307.9 


48799 


37357 


11442 


L-1 0 


570370 


308.3 


449849 


321.8 


50522 


39471 


11051 


L-1 1 


220568 


273.4 


149645 


294.3 


18215 


17186 


1029 


L-1 2 


104517 


281.2 


73958 


298.3 


8442 


8188 


254 


L-1 3 


743576 


248.8 


411035 


263.7 


42318 


38830 


3488 


L-1 4 


413925 


271.2 


323372 


278.6 


38794 


34102 


4692 


L-1 5 


401170 


261.0 


321153 


269.2 


38359 


33447 


4912 


L-1 6 


320673 


259.2 


1 90983 


277.7 


21694 


19607 


2087 


L-1 7 


350843 


262.0 


203567 


282.3 


23857 


21989 


1868 


L-1 8 


774553 


254.5 


506642 


268.6 


46983 


41086 


5897 


L-1 9 


650604 


272.3 


333283 


288.9 


37926 


29543 


8383 



Processed Reads represents the number of nuclear sequences after the pre-processing (Figure 1). # stands for number, <l > for average length. 



unique genes detected, using a cut off of e < 10' 5 ' we found 
that 65% of cork oak unigenes hit 23,482 out of 27,379 pre- 
dicted proteins in A. thaliana (85%), and 30,318 out of 
45,555 in P. trichocarpa (67%) [23]. These numbers repre- 
sent a rough estimate of the upper (85%) and lower (67%) 
boundaries one can expect from the Q. suber transcrip- 
tome coverage. This figure doesn't change significantly if 
we use a more lenient cut off of e < 1CT , where we hit 
24,093 (79%) and 30,719 (67%), respectively. A high degree 
of redundancy in our unigenes is suggested, as multiple 
unigenes hit the same target genes in either species. The 
remaining 55,921 unigenes cannot find any hit in either 
A. thaliana or P. trichocarpa, representing about 35% of 
the cork oak transcriptome. These include small uni- 
genes that would not achieve significance in BLASTp 
comparisons (see Figure 2A), as well as potential novel 
genes not present in these two genomes. This number 
could be eventually overestimated, if we consider some 
under-assembly in our libraries. 

We performed a serial clustering at increasing levels of 
identity in order to evaluate the degree of redundancy in 
our assembly (Figure 2C). We found that at the protein 
level, there was a sharp decrease in the number of clus- 
ters at 95% identity, indicating that approximately 8000 
predicted peptides show a high identity between each 



other, comparable to that found in other oak species 
[15]. This could indicate a recent event of polyploidiza- 
tion giving rise to many highly similar genes. Alterna- 
tively, and probably most likely, this could be accounted 
by the high genetic diversity among the multiple unre- 
lated trees used to prepare the libraries [9]. Sequencing 
errors not fully resolved due to the relatively low cover- 
age of many unigenes could also be responsible for this 
result. In the first scenario our decision to filter off re- 
dundancies at the cDNA level at 98% could have been 
excessive, leading to the underestimation of the pre- 
dicted number of unigenes. In contrast, the second and 
third scenarios would suggest that 95% is insufficient 
and we are overestimating the number of unigenes that 
may be closer to 151,000. We do not have enough data 
to favour any of these scenarios, in particular because all 
three may co-exist. We have thus chosen the 98% cDNA 
clustering as a conservative parameter that we hope does 
not over-cluster paralogues. With future data accumula- 
tion, it will be easier to fuse unigenes than to resolve in- 
correctly clustered paralogues. 

Functional annotation 

We mapped the cork oak unigenes to the functional 
classes defined in Gene Ontology (GO) [24]. We had 
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Figure 1 Schematic representation of the bioinformatics 
pipeline, indicating the software used at each step. 



73,766 sequences mapped to at least one GO term and 
the unigenes covered a total of 2,273 different GO terms. 
Each unigene mapped to 3.66 terms on average. The vast 
majority of terms is present at low frequency, with a few 
functional classes dominating. The Biological process 
"Metabolism" was the most frequent, with other metabolic 
categories in the top five categories - metabolism related 
categories cover 68% of the terms assigned (Figure 3). 
Consistently, enzyme functions dominate the Molecular 
Functions ("Catalytic activity", "Transferase activity", 
"Hydrolase activity") (Figure 3). These are in contrast with 
the combined ESTs of two other oaks, Q. petraea and Q. 
robur, where the classes Transport (Biological Process) 
and Nucleotide Binding (Molecular Function) dominate 
[15]. Note, however, that this difference may simply lie in 
the fact that in that study non-normalized libraries were 
used, resulting in under-representation of lowly expressed 
genes. Furthermore, this difference may also lie in the fact 
that in that study, nuclear and organelle transcriptomes 
were, to the best of our knowledge, assembled together, 
while we removed both chloroplast and mitochondrial 
sequences from our assembly. This is supported by the 
observation that in the GO Cellular Component classi- 
fication, the "Plastid" class is the most frequent in the 
Q. petraea/ Q. robur ESTs, while in the cork oak, intra- 
cellular classes dominate ("Cell", "Intracellular", "Cyto- 
plasm", etc.) (Figure 3). 

We used a simple and conservative scheme for gene 
naming of the cork oak unigenes. Besides its accession 
number (see below for details), we gave it an unigene 
name based on its similarity to proteins in A. thaliana 
and P. trichocarpa (Table 4). We observed that for nearly 
40% of the unigenes we could not assign a clear annota- 
tion at cut off of e < 1CT (Figure 4), consistent with the 
number of unigenes that are not similar to any gene 
in other model plants. Conversely, we could identify 



(A) 



7,445,712 raw ESTs (19 Libraries) 

Pre-processing 



(B) 



(C) 



4,968,463 ESTs (processed nuclear reads) § 
I individual assemblies >, 
A, clustering c s 

833,767 contigs + singletons + singlets 

■ P = 

clustering £ | 

I multilibrary assembly 




159,298 unigenes 



0 



n 



P. trichocarpa 



200 600 1000 1600 2200 2800 3400 4000 4600 



Unigene Length Reads/Unigene Threshold (% identity) 

Figure 2 Assembly and predicted peptide statistics. (A) Unigene length distribution after multi-library assembly. There are 12 additional 
unigenes longer than 4600 bases, not shown on the plot, with the longest one being 9189 bases. (B) Unigene coverage (reads per unigene). 
(C) Serial clustering of predicted proteins based on the cork oak unigenes, and of the predicted proteins from the genomes of two model 
plant species. 
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Table 3 Assembly metrics of this project compared with those of two large oak transcriptome sequencing projects 





Q. suber (this study) 


Q. petraea/Q. robur [1 5] 


Q. robur [16] 


Sequencing platform 


454 


454 + Sanger 


454 + lllumina 


Libraries 


21 


14 (454) + 20 (Sanger) 


16 (454) + 8 (lllumina) 


Total reads 


7,445,712 


1,578,192 (454) + 145,827 (Sanger) 


821,534 (454) + 255,237,702 (lllumina) 


Contigs & single reads 


1 59,298 


222,671 


65,712 


mean length 


148.5 


235.8 


1003 



Biological Process 



Molecular function 



Cellular Component 






O metabolism 

nucleobase, nucleoside, nucleotide and nucleic acid r 
9 biosynthesis 

cell organization and biogenesis 
C development 

• transport 
organelle organization and biogenesis 
lipid metabolism 

9 carbohydrate metabolism 

protein metabolism 
9 catabolism 
9 cell cycle 

response to stress 

morphogenesis 

• reproduction 
protein modification 

9 cell differentiation 

signal transduction 

DNA metabolism 

secondary metabolism 

response to abiotic stimulus 

cytoskeleton organization and biogenesis 
. | ion transport 

response to endogenous stimulus 

growth 

response to biotic stimulus 
response to external stimulus 
protein transport 

generation of precursor metabolites and energy 

embryonic development 

regulation of gene expression, epigenetic 

death 

behavior 

cell death 

cell communication 

cell homeostasis 
9 cell growth 

protein biosynthesis 
9 viral life cycle 

mitochondrion organization and biogenesis 
9 cell recognition 
9 cell-cell signaling 

cell proliferation 

Figure 3 Gene Ontology classification of nuclear unigenes. Classification was performed using CateGOrizer, counting single occurrences and 
the Generic GO Slim [25]. Percentages are shown down to 3% only, and the functional classes are ordered by frequency. 



catalytic activity 
transferase activity 
hydrolase activity 
transporter activity 
binding 

protein binding 
kinase activity 
nucleic acid binding 
enzyme regulator activity 
nuclease activity 
ion channel activity 
peptidase activity 
signal transducer activity 
DNA binding 
protein kinase activity 
nucleotide binding 
receptor activity 
RNA binding 
lipid binding 

cytoskeletal protein binding 
receptor binding 

phosphoprotein phosphatase activity 

carbohydrate binding 

translation regulator activity 

translation factor activity, nucleic acid binding 

motor activity 

antioxidant activity 

chromatin binding 

structural molecule activity 

transcription regulator activity 

actin binding 



• cell 
intracellular 

9 cytoplasm 
. nucleus 

• cytoskeleton 
9 chromosome 

• plastid 

cytoplasmic membrane-bound vesicle 
9 thylakoid 

mitochondrion 
9 nucleoplasm 
9 vacuole 

Golgi apparatus 

endoplasmic reticulum 
9 extracellular region 

nuclear chromosome 
9 ribosome 
9 plasma membrane 

external encapsulating structure 

endosome 

peroxisome 

nuclear membrane 
9 cell wall 

O microtubule organizing center 
cell envelope 
lysosome 
lipid particle 
cilium 
nucleolus 

extracellular matrix (sensu Metazoa) 
cytoplasmic chromosome 
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Table 4 Unigene naming criteria are as follows 



Method Assignment 



BDBH 




Ortholog 


BLASTp search 






Alignment length 


identity 




> 85% 


> 35% 


High confidence 


> 70% 


> 25% 


Homolog 


< 70% 


> 30% 


Conserved domain 


< 70% 


< 30% 


Low confidence 



If a gene is bi-directional best hit (BDBH) of X in A. thaliana (or P. trichocarpa), 
we term it ortholog of X; if it is similar to X in A. thaliana (or P. trichocarpa) 
using BLASTp and it aligns in 85% of its length with more than 35% identity, 
we term it a High confidence X in Q. suber, etc. 



conserved domains in 44% of the unigenes, and could 
establish clear homology relationships to an additional 
16% of the unigenes, in a total of 60% unigenes with 
clear functional assignments in GO. 

We were able to map Interpro domains to 108,341 
unigenes (68%). Nearly half of the domains were wide- 
spread in evolution, being present in both Eukaryota and 
Bacteria (Figure 5). The other half was dominated by 
general Eukaryotic domains and less than 10% of the do- 
mains were plant specific. These results are comparable 
to those reported for the complete genomes of A. thaliana, 
P. trichocarpa and P. persica genomes, as well as to those of 
the transcriptomes of the closely related Quercus robur and 
Castanea mollissima which are also depicted in Figure 5. 

Evolution 

We compared the gene content of the cork oak, as esti- 
mated by our EST sequencing project, with that of 31 
completely sequenced plant genomes. We used BLASTp 




Figure 4 Distribution of annotation classes in the cork oak 
translated unigenes. 



at e < 10~ 5 and also at the permissive cut off of e < 10~ 2 
to determine how many predicted proteins in those spe- 
cies are similar to at least one cork oak unigene. The re- 
sults of this analysis are shown in Figure 6, indicating a 
broad concordance with the generic taxonomic/evolu- 
tionary distance of the species. This result does not 
change when we use a more permissive cut off of e < 10 
(not shown). 

We compared the unigenes derived from the cork oak 
with those of the red oak (Q. rubra), the pedunculate 
Oak (Q. robur - also known as English or French oak) 
and the Chinese chestnut (Castanea mollissima). For 
this comparison, the data from the Fagaceae Genome 
Web was used, for Q. rubra and C. mollissima which in- 
clude multiple tissues also sequenced using the 454 py- 
rosequencing platform (www.fagaceae.org/node/87455 
and www.fagaceae.org/node/181796/, respectively), and 
the data for Q. robur, which included 454 and Illumina 
generated sequences, and was obtained from www.ufz. 
de/trophinoak/index.php?de=31205 [16,26]. We used our 
own assembly pipeline on these sequences to ensure that 
no additional differences were introduced on methodo- 
logical grounds. The comparison is shown in Figure 7. 
The total number of distinct unigenes is higher in the 
cork oak project, probably reflecting the higher number 
of tissues and conditions sampled in our libraries, as well 
as incomplete assembly due to library biases and genetic 
heterogeneity of the samples. We verified that between 
77% and 82% of the unigenes from those species are 
similar to at least one unigene in the cork oak, as ex- 
pected from evolutionarily close species. The remaining 
18% - 23% of the unigenes of the red and english oaks 
and chestnut tree are likely species-specific, but may also 
be partially accounted by an incomplete coverage of the 
Q. suber. The large number of cork oak unigenes that 
does not find a hit in the other transcriptomes (30% - 
44% at e < 10 ) does however suggest that, most likely, 
this is not a major factor. This cork-oak-specific set rep- 
resents a mixture of small reads that fail to attain statis- 
tical significance (e.g. from incomplete assembly), as well 
as a putative set of cork oak-specific genes. Note that 
when we compare Q. suber with a completely sequenced 
genome of the Prunus persica, 94% of the P. persica 
genes find a hit in Q. suber, further suggesting that in- 
complete coverage of the gene space was probably not a 
major problem of our project. 

Database and interface 

To support the assembly and annotation pipeline we 
have a data warehouse system that records the data and 
metadata associated with each step of the pipeline. This 
is described in a companion paper (in preparation). From 
this warehouse we generated a public portal as a commu- 
nity resource for cork oak genomics, which is found at 
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Figure 5 Unique Interpro domains assigned to the Q. suber unigenes and two other transcriptomes for Q. robur and Castanea 
mollissima, as well as for species with completely sequenced genomes A. thaliana, P. trichocarpa and P. persica. 



http://www.corkoakdb.org. The assembled genes, the pro- 
teins they encode, and the functional annotations are 
made accessible through a web interface, partially shown 
in Figure 8. The gene view features sequence data, cDNA 
and protein, as well as plots of base-by-base coverage in- 
formation for the unigene. Users are shown pre-computed 
phylogenetic profiles against other plants according to two 
distinct methods, the bi-directional best BLAST hit and 
the inparanoid, two standard methods to identify ortho- 
logs and paralogues [27]. The gene view further includes 
functional annotations, namely GO annotations, Interpro 
domain assignments, KEGG pathways and best BLAST 
hits against general and plant-specific databases. Genes of 
interest can be discovered by searching specific fields or 
by running a nucleotide or protein BLAST search against 
the Cork Oak database. 

Conclusions 

We have developed the first large-scale library for the 
cork oak, an important economic resource in Southern 
Europe and North of Africa. We carried out a preliminary 
analysis of its gene content and functional annotation, and 
built a public platform for data sharing. Nineteen different 
libraries were sequenced, covering genes expressed in mul- 
tiple tissues, developmental stages and stress conditions. 



Our results suggest that we covered a large fraction of the 
cork oak gene space. Many of its unigenes are dissimilar to 
any other plant genes. These likely represent incomplete 
assemblies due to library biases, but may also include sev- 
eral true cork-oak specific genes, which once identified will 
represent a promising avenue to understand the molecular 
basis of the response leading to cork formation. We believe 
that this sequencing effort will enable the community to 
explore the molecular basis of the cork oak physiology, as 
well as its responses to the multiple abiotic and biotic chal- 
lenges that the cork oak forest is currently experiencing. 

Methods 

Samples, collection and preparation 

Within this initiative, in order to guarantee high tran- 
script coverage and to increase gene diversity, total RNA 
was isolated from Quercus suber biological samples ob- 
tained from different organs and tissues at varying de- 
velopmental stages (roots, leaves, buds, flowers, fruits, 
phellogen, vascular tissue, good and bad quality cork), 
as well as from plants that had been exposed to infec- 
tion with Phytophthom cinnamomi, symbiosis with Piso- 
lithus tinctorius mycorrhizal fungus and different abiotic 
stresses (cold, heat, drought, salinity and oxidative stress). 
Furthermore, total RNA was also isolated, at two distinct 



Pereira-Leal et ol. BMC Genomics 2014, 15:371 
http://www.biomedcentral.com/1471-2164/15/371 



Page 9 of 14 



Setaria italica 

- i Sorgum bicolor 

I Zea mays 

I Brachypodium distachyon 

Oryza sativa 
Aquilegia coerulea 

Prunus persica 

r I Malus domestica 

I Fragaria vesca 

Cucumis sativus 

Populus trichocarpa 

I Glycine max 

Medicago truncatula 

, Eucalytpus grandis 

Theobroma cacao 

i Citrus sinensis 

' Citrus Clementina 

I Carica papaya 

I Arabidopsis thaliana 

Arabidopsis lyrata 

Vitis vinifera 

Selaginella moelendorfii 

Physcomitrella patens 

Ostreococcus sp. 

Ostreococcus tauri 

Ostreococcus lucimarinus 

Micromonas sp. 

Chlamydomonas reinhardtii 

Volvox carteri 

Chlorelia vulgaris 

Chlorelia sp. 

Coccomyxa sp. C-169 



0 10000 20000 30( 

Number of BLAST Hits 

Figure 6 Number of the cork oak's predicted peptides unique BLAST hits in other plant genomes. 
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Figure 8 CorkOakdb.org. Screenshot of the top part of the gene view. 



dates (May and September), from annual shoots of 30 years 
old Quercus suber x cerris hybrid trees that either pro- 
duce or don't produce cork, in order to cover different 
developmental stages of the phellogen meristem. No ap- 
proval or licenses were required for sample collection. 
In each library, plant material from half-siblings (e.g. 
abiotic and biotic stress libraries) or from several unre- 
lated trees was used. All the plant material used was 
from Portuguese trees except for those trees used to de- 
tect polymorphism, which were from different Mediter- 
ranean countries [28]. The detailed conditions applied 
in each situation are described in www.corkoakdb.org/ 
libraries. The full set of libraries is described in Table 1. 



cDNA preparation, library normalization and 
pyrosequencing 

Total RNA from each tissue/condition was used as the 
source of starting material for cDNA synthesis and pro- 
duction of normalized cDNA libraries intended for 454 
sequencing. Briefly, the total RNA quality was verified 
on Agilent 2100 Bioanalyzer with the RNA 6000 Pico 
kit (Agilent Technologies, Waldbronn, Germany) and 
the quantity assessed by fluorimetry with the Quant-iT 
RiboGreen RNA kit (Invitrogen, CA, USA). A fraction 
of 1-2 ug of total RNA was used for cDNA synthesis 
with the MINT cDNA synthesis kit (Evrogen, Moscow, 
Russia), a strategy based on the SMART double-stranded 
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cDNA synthesis methodology using a modified template- 
switching approach that allows the introduction of known 
adapter sequences to both ends of the first-strand cDNA. 
Amplified cDNA was then normalized with TRIMMER 
cDNA Normalization kit (Evrogen, Moscow, Russia) using 
the Duplex-Specific Nuclease-technology [20,29]. 

Normalized cDNA was quantified by fluorescence and se- 
quenced in 454 GS FLX Titanium according to the stand- 
ard manufacturers instructions (Roche-454 Life Sciences, 
Brandford, CT, USA) at Biocant (Cantanhede, Portugal). 

Sequence processing and assembly 

The implemented sequence analysis strategy included an 
initial pre-processing stage, performed on each library, 
where contaminant, low quality, redundant and repeat- 
full sequences were removed and each library assembled. 
This was followed by a multilibrary assembly (described 
below, and summarized in Figure 1). Initially, each read, 
respective quality scores and ancillary information, were 
extracted from the sequencing machine output (.sff), using 
open source software sff extract (http://bioinf.comav.upv. 
es/sff_extract/). Reads of each sample were selected using a 
Python pipeline that screens the reads for primer se- 
quences, classifying them by sample origin and allocating 
them in different files. For each sample we generated a file 
with the sequences (.fasta) and the corresponding file with 
the quality scores (.qual). At this stage we removed adap- 
tors and reads smaller than 40 bp. Thereafter, artificial du- 
plicates associated with pyrosequencing were removed 
using cd-hit-454 [30] at a threshold of 98%, and Seq-trim 
[31] was used to remove small sequences (length < 100 bp) 
or sequences with low quality (QV > 20, quality window = 
10), as well as poly-A or poly-T tails, and adaptors. 

In the following step, contaminant sequences were re- 
moved. For this, a database of possible types of contami- 
nants was prepared (ContaminantsDB - see supplementary 
material for details) and queried with the Q. suber reads 
using BLASTn (5, -E 3 -e le-09 -q -5 -b 1 -G 3). Reads 
that found a match in this database, were subsequently 
blasted against a database of plant proteins (PlantDB - see 
supplementary material for details) using the same parame- 
ters as before. If the hit (match) e-value in ContaminantsDB 
was smaller than hit (match) e-value in Plant DB, the 
read was considered as a contaminant and removed from 
the pipeline. The remaining reads continued in the pipe- 
line to be screened for repetitive elements, using the 
program RepeatMasker 3.2.9 (www.repeatmasker.org) 
against PlantRepeatsDB [32]. Whenever sequences were 
masked in more than 90% of their length they were 
discarded. 

The final step of the preprocessing stage was the 
classification of all the trimmed reads into potential 
mitochondrial, chloroplastidial or nuclear sequences. For 
this, a BLASTn (-e = 0.001) was first performed against a 



database containing coding region sequences from 
complete plant mitochondrial genomes (from Arabidopsis 
thaliana, Medicago truncatula and Populus tricocharpd). 
The sequences that presented a hit were considered 
potential mitochondrial sequences and were kept in a 
FASTA file reserved for this organelle sequences. A simi- 
lar process was then applied against a database of coding 
region sequences of plant complete plastidial genomes 
(same organisms). 

Assembly 

We chose MIRA 3.2.0 [33] to assemble the resulting 
sequences, as this has been shown to have higher 
coverage than other assemblers [34]. For each library, 
we obtained contigs and singletons with the following 
parameters: — job = denovo, est, accurate, 454; — GE: 
not = 20; — SIvnot = 20; 454_SETTINGS -LR:mxti = no, - 
CL:qc = no:cpat = no:mbc = yes, — AL:egp = no:mrs = 85, - 
OUTsssip = yes, -AS:mrpc = 1. Following this step, all the 
contigs and singlets resulting from the assembly of each li- 
brary were then clustered to remove redundancy using 
CD-HiT [35], and the resulting non-redundant sequence 
collection was re-assembled using the same parameters as 
before. The resulting sequences were considered to be 
Unigenes, and at this point they were given an unigene ac- 
cession number. Libraries L20 and L21 were not used in 
the analysis presented in this manuscript, but are available 
in the full assembly on the CorkOakDB. 

Protein prediction 

In order to be able to translate the nucleotide sequences 
to protein sequences, the pipeline first performs a Blast 
search (blastx) against a RNA database [36], to remove 
non-protein coding unigenes. It then queries all Viridi- 
plantae protein sequences existing in the Uniprot database 
[37]. The program Prot4EST [38] then takes the outputs of 
these BLAST searches and translates the sequences into pu- 
tative peptide sequences. Those unigenes without signifi- 
cant hits are translated using the program ESTscan [39], 
and for the remaining untranslated sequences, the longest 
ORF of the 6 frames is selected. 

Sequence naming 

In order to assign names to the genes/proteins found, 
putative peptides were used to query, using BLASTp at a 
cut off of e < 10~ 5 , a database of Uniprot sequences from A. 
thaliana and P. tricocharpa. Whenever a putative peptide 
does not have a hit, it is considered "Predicted hypothetical 
protein". If a similar hit is detected, then the protein name 
is assigned to the putative peptide in Q. suber together with 
a label that describes the level of confidence of the annota- 
tion (see Table 4). 
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Functional annotation 

In order to obtain domains and functional sites of putative 
peptides, an Interpro search was executed [40]. The Inter- 
pro database [41] integrates different classification methods 
based on amino-acid patterns and profiles, protein family 
fingerprints, protein sequences and structural domains, as 
well as functional information. The Interpro database 28.0 
was downloaded and searches were run locally. Afterwards, 
a BLAST (BLASTp) search against non-redundant protein 
database was executed and results entered the program 
Blast2GO [42]. We used the pipeline version of the B2G 
called B2g4pipe, obtaining GO-terms and E.C. Numbers. 
The same pipeline was used to assign Interpro domains for 
the transcriptomes analysed in Figure 5. 

Database implementation 

A MySQL relational database was deployed, using the 
InnoDB engine to allow rollback of transactions in case 
of failure. This was essential, given the progressive na- 
ture of the data loading. Every EST sequence was stored 
in the database, and as each step of the pipeline was ran, 
the results were added to the corresponding tables, up to 
the functional annotation of assembled unigenes, as well 
as metadata related to the EST libraries. Some intermedi- 
ate output data, such as large FASTA and XML files, were 
kept on the file system. The web interface is powered by a 
Python application built on Django (an open source web 
framework), HTML/ CSS and Javascript. KEGG data is dis- 
played using the KEGG SOAP API. 

Accession numbers and unigene naming 

Accession numbers on the corkoakDB have the following 
format QS_000000, for unigenes, and QS_P_000000 for 
putative peptides. Whenever the sequences are putative 
mitochondrial or potential chloroplast sequences they start 
with QSm or QSc, respectively. 

Evolutionary analysis 

Comparisons to other organisms were made using pre- 
dicted proteomes obtained from the superfamily database 
[43] release 1.75. We used BLASTp for the comparisons, 
always filtering for low complexity regions and using the 
cut offs indicated in the text. We used the standard NCBI's 
taxonomic tree as a reference for Figure 6. Red oak librar- 
ies were obtained from the Fagaceae genomics web (www. 
fagaceae.org/node/87455) and processed using our own 
pipeline, resulting in 38,346 predicted unigenes. We then 
used BLASTp with a cut off at e = 0.01 to determine how 
many unigenes from the cork oak were similar to at least 
one unigene in the red oak. 

Availability of supporting data 

All sequenced ESTs were submitted to the sequence 
read archive (http://www.ncbi.nlm.nih.gov/sra) with the 



accession number ERP001762, and accession name 
"Cork Oak". 
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